Some Practical Suggestions for Performing Gaussian Benchmarks on a pSeries 690 System
نویسنده
چکیده
Gaussian98 is a connected series of programs from Gaussian, Inc., that can perform a variety of semi-empirical, ab initio, and density functional theory calculations. For more than 20 years, the Gaussian program has been extensively used at universities, and in the pharmaceutical and chemical industries to carry out basic research in the simulation and elucidation of new pharmaceuticals or materials. Gaussian has introduced new algorithmic improvements that leverage the latest IBM eServer architectures, in particular, the pSeries. In this study, we present a series of benchmarks to analyze the performance of Gaussian as a function of multiple input parameters. This study also provides a guide to optimize system resources management on IBM pSeries. Introduction The IBM systems based on the POWER processor were introduced in February of 1990. It was based on a multiple chip implementation of the POWER architecture [1,2,4]. This technology is now commonly referred to as POWER1. The model introduced included an eight KB instruction cache (I-cache) and either a 32 KB or 64 KB data cache (D-Cache). They had single floating point unit (FPU) capable of issuing one compound floating-point multiply add (FMA) operation each cycle, with a latency of only two cycles. Therefore, the peak MFLOPS rate was equal to twice the MHz rate. Announced in September 1993, the Model 590 was the first RS/6000 machine based on the POWER2 architecture [1,2,4]. The most significant improvement introduced with the POWER2 architecture was that the FPU contained two 64-bit execution units, so that two FMAs instruction could be executed in each cycle. In addition, a second fixed-point execution unit and several new hardware instruction were added: Quad-word storage instructions: The quad-word load instruction moves two adjacent double-precision values into two adjacent floating-point registers. Hardware square root instruction Floating-point to integer conversion instruction Although the Model 590 ran only with a marginally faster clock than the POWER1-based Model 580, the architectural improvements listed above, combined with a larger 256 KB D-Cache, enabled it to achieve far grater level of performance. As an example, the MFLOPS achieved in the Linpack TPP benchmark went from 104 to 237. In October 1996, IBM announced the RS/6000 Model 595 [1,2,4]. This was the first machine to be based on the P2SC (POWER2 Super Chip) processor. This is a single chip implementation of the POWER2 architecture, enabling the clock speed to be increased further. The Model 595 runs at 135 MHz, and the fastest P2SC processors, found in Model 397 workstation and RS/6000 SP Thin4 nodes, run at 160 MHz, with a theoretical peak peak performance of 640 MFLOPS. IBM ~ Performance Technical Report Suggestions for Performing Gaussian Benchmarks Page 2 The POWER3 [3] microprocessor introduced a new generation of 64-bit processors especially designed for high performance and visual computing applications. POWER processors are the replacement for the POWER2 and P2SC in high-end RS/6000 workstations and technical servers. The first RS64 processor was introduced in September of 1997 and was the first step into 64-bit computing for RS/6000 [1,2,4]. While the POWER2 product had strong floating-point performance, this series of products emphasized strong commercial server performance. It ran at 125 MHz with a 2-way associative, 4 MB L2 cache and had a 64 KB L1 instruction cache, a 64 KB L1 data cache, one floating-point unit, one load-store unit, and one integer unit. Systems were designed to use up to 12 processors. pSeries products using the RS64 were the first pSeries products to have the same processor and memory system as iSeries products. In September 1998, the RS64-II was introduced. It was a different design from the RS64 and increased the clock frequency to 262 MHz. The L2 cache became 4-way set associative with an increase in size to 8 MB. It had a 64 KB L1 instruction cache, a 64 KB L1 data cache, one floating-point unit, one load-store unit, two integer units, and a short in-order pipeline optimized for conditional branches. With the introduction of the RS64-III in the fall of 1999 [4], this design was modified to use copper technology, achieving a clock frequency of 450 MHz, with a L1 instruction and data cache increased to 128 KB each. This product also introduced hardware multithreading for use by AIX. Systems were designed to use up to 24 processors. In the fall of 2000, this design was enhanced to use silicon on insulator (SOI) technology, enabling the clock frequency to be increased to 600 MHz. The L2 cache size was increased to 16 MB on some models. Continued development of this design provided processors running at 750 MHz. The most recent version of this microprocessor was called the RS64-IV. The new POWER4 processor continues the evolution [4]. The POWER4 processor chip contains two microprocessor cores, chip and system pervasive functions, core interface logic, a 1.41 MB level-2 (L2) cache and controls, the level-3 (L3) cache directory and controls, and the fabric controller that controls the flow of information and control data between the L2 and L3 and between chips. Table I provides a processor comparison. Table 1. Processor Comparison 512 N/A N/A L3-cache size [mb] 1.41 16 8 L2-cache size [mb] 32 128 128 D-cache size [kb] 64 128 32 I-cache size [kb] Dynamic Static Dynamic Branch Prediction 5 4 4 Dispatch Width 2 1 1 Branch/Other Units 2 1 2 Load/Store Units 2 1 2 Floating Point Units 2 2 3 Fixed Point Units 1,300 450 375 Frequency [MHz] POWER4 RS64-III POWER3 IBM ~ Performance Technical Report Suggestions for Performing Gaussian Benchmarks Page 3 Yes No Yes Data Prefetch a Shared between two cores b Shared between 32 cores Each microprocessor contains a 64 KB level-1 instruction cache, a 32 KB level-1 data cache, two fixed-point execution units, two floating-point execution units, two load/store execution units, one branch execution unit, and one execution unit to perform logical operations on the condition. Instructions dispatched in program order in groups are issued out of program order to the execution units, with a bias towards oldest operations first. Groups can consist of up to five instructions, and are always terminated by a branch instruction. The processors on the first IBM POWER4-equipped servers, the IBM pSeries 690 Model 681 servers, operate at either 1100 MHz or 1300 MHz. Table 2 compares SPECint2000 and SPECfp2000 for the latest POWER-based processors. Table 2. Comparative Metrics 1,169 210 433 SPECfp2000 814 234 335 SPECint2000 POWER4 1300 MHz RS64-III 450 MHz POWER3 450 MHz Metric a http://www.spec.org The combined effort between software and hardware is what makes possible to carry out calculations that just a few years ago were unthinkable. An RHF/6-31G** single point energy calculation on triamino-trinitro-benzene (TATB) using Gaussian88 on an IBM RS/6000 model 550 used to take 4.5 hours [5]. The same calculation using Gaussian92 took about 1 hour [5]. Today this calculation takes less than 2 minutes on the latest IBM POWER3 systems. In this study, we consider a series of benchmarks, not only to compare the POWER3 systems against the POWER4, but also to look at the performance of POWER4 HPC (code name Regatta-HPC) versus the POWER4 Turbo (code name Regatta-H). In the next section we summarize the design features of the systems employed in this study. We also devote a section to look at how Gaussian has been parallelized. In the last four sections we describe the benchmarks and discuss the results in terms of parallel speedup and efficiency. A summary is provided that we hope will serve as a guidance for running or selecting appropriate parallel benchmarks. Design Features Three different types of IBM pSeries servers were used in this study: 375 MHz POWER3-II multiprocessor (SMP) and two pSeries 690 servers. One corresponds to the 1.3 GHz POWER4 pSeries 690 HPC (High Performance Computing) and the other system was 1.3 GHz POWER4 pSeries 690 Turbo. For simplicity, we will refer to them as the HPC system and Turbo system. IBM ~ Performance Technical Report Suggestions for Performing Gaussian Benchmarks Page 4 The pSeries 690 server is the latest UNIX server from IBM and provides a new architecture [4,6]. At the core of its architecture is the POWER4 Multi-chip Module (MCM). The building blocks for the systems utilized here are: an 8-way MCM Turbo running at 1.3 GHz and a 4-way MCM HPC running at 1.3 GHz. One of the differences of the HPC system (4-way MCM) is that it is optimized for data intensive applications that require larger memory bandwidth per core. Each POWER4 HPC processor chip contains one rather than two cores; the L2 cache is dedicated for each core. On the other hand, the Turbo system (8-way MCM) has two cores per L2 cache, hence, one MCM is 8-way. A full description of the POWER4 architecture is beyond the scope of this work. For further details, see refs. [4] and [6]. However, in this section we will provide an overview of the most important features of this architecture. Each processor chip on the pSeries 690 consists of either one or two cores, an L2 cache that runs at the same speed of the microprocessor, the microprocessor interface unit which is the interface for each microprocessor to the rest of the system, the directory and cache controller for the L3 cache, the fabric bus controller, and a GX bus controller that enable I/O devices to connect to the Central Electronic Complex (CEC). The L3 cache is a new component not available on the POWER3 architecture. The L3 caches are mounted on a separate module. The 375 MHz POWER3-II system consisted of 16 processors, 24 GB of memory, and 8 MB L2 cache. The Turbo system1.3 GHz consisted of 32 processors and 8X32 GB memory cards for a total of 256 GB of memory. The HPC system 1.3 GHz had a total of 8X32 GB memory cards for a total of 256 GB of memory. All the timings in this work correspond to elapsed time. Timings were measured by using the time printed by each link using the #p option. Gaussian prints the clock time and CPU time. This information was used as input for a utility program that tabulates timings and speedups. Parallel Gaussian Gaussian [7], a connected series of programs, can be used for performing different kind of electronic structure calculations, for example semi-empirical, ab initio, and density functional theory calculations. Gaussian consists of a collection of programs commonly known as links; each link communicates through disk files. Links are grouped into overlays [8]. Links are independent executables located in the g98 directory and labeled as lxxx.exe; where xxx is the unique number of each link. In general, overlay zero is responsible to start the program, this includes reading of the input file. After the input file is read the route card (keywords and options that specify all the Gaussian parameters) is translated into a sequence of links. Overlay 99 (l9999.exe) terminates the run, in most cases l9999.exe finishes with an archive entry (brief summary of the calculation). As previously pointed out, the Gaussian architecture on shared-addressable or distributed memory systems is basically the same [9]. Each link is responsible for continuing the sequence of links by invoking exec() system call to run the next link. The links that have been parallelized will run on multiple processors. The links that run sequentially are mainly links that are responsible for setting up the calculation and assigning symmetry. Although in previous publications we have summarized all the links that run in parallel [9], it is important to note that, the use of multiple processors benefit from calculations that make use of the PRISM routines [10]. In a self-consistent field (SCF) scheme, the two-electron integrals are part of the Fock matrix[11]: IBM ~ Performance Technical Report Suggestions for Performing Gaussian Benchmarks Page 5 (1) Flm = Hlm + S k Sr Pkr[(lm|kr) − 1/2(lk|mr)] where H represents the core Hamiltoninan. m, n, l, s are atomic orbital indices. The quantities (mn|ls) are two-electron repulsion integrals. In Gaussian, these quantities are computed once and stored or recomputed many times as needed, depending on the memory available and the algorithm chosen. In density functional theory (DFT), eq. (1) can be rewritten by replacing the last term by the well-known exchange-correlation term F[12]. (2) Flm = Hlm + S k Sr Pkr(lm|kr) + Flm XC In Gaussian, the two-electron integrals are parallelized by distributing batches of integrals among all the available processors (normally selected using the %nproc keyword in the input file). This procedure is illustrated in Figure 1. The main loop over NPROCS (number of available processors) determines the number of integrals that will be distributed among the processors. Each task looks at 1/NPROCS of the shell quartets, discarding those that do not need to be done due to symmetry, cutoffs, or the fact that they have already been done. All the integrations are carried out in these loops. Within these series of loops and still in the main NPROCS loop, the last loop sums up the contributions to the Fock matrix, derivative matrices, or density matrices, depending on the types of integrals computed. The next loop outside the first NPROCS loop, is another NPROCS loop that adds all the contributions to the Fock matrix together in a serial block of code. This scheme can be exploited to compute first and second two-electron integral derivatives, first and second one-lectron integral derivatives and electrostatic potential integrals. loop over NPROCS loop over total angular momentum loop over degrees of bra and ket contraction do integrals for 1/NPROCS of shell quartets endloop endloop add integral contributions to partial Fock matrix endloop loop over NPROCS (sequential code) add 1/NPROCS Fock matrix contributions endloop Figure 1. Parallel loops to compute two-electron integrals in Gaussian. The parallelization of the Fock matrix in Gaussian98 on IBM shared-memory architectures has been accomplished by using standard UNIX fork and wait commands [13]. fork() creates a new process. Upon successful completion, the fork subroutine returns a value of 0 to the child process and returns the process ID of the child process to the parent process. Otherwise, a value of -1 is returned to the parent process, no child process is created. In Gaussian98 fork() is called within the NPROCS loop and each new child executes a batch of integrals. All the children that have completed their task wait prior to entering the sequential do loop to add all their contributions. IBM ~ Performance Technical Report Suggestions for Performing Gaussian Benchmarks Page 6 Selected Benchmarks Similarly as in previous studies [9,13], we consider four major characteristics when studying parallel performance on the IBM SP or selecting benchmarks: job type, theoretical method, basis set size, and molecular size. The job type corresponds to a single point energy, a gradient calculation, or calculation of second derivatives. Normally single point calculations are used to compute accurate energies at a level of theory that is too expensive to carry out a full geometry optimization for any medium-to-large size systems. Due to the importance of molecular structure in chemistry, a large majority of calculations are geometry optimizations using Hartree-Fock or Density Functional Theory. This type of calculations are normally followed by a frequency calculation. To a lesser extent, geometries are also carried out at the MP2 level of theory. There are many options for carrying out calculations using Gaussian. In this study we tried to select job types that reflect how users are currently running Gaussian. This by no means represent the only options used in the program but they represent a large percentage of calculations carried out at computer centers. They also correspond to many of the benchmarks carried out for hardware procurements. In this study we have chosen the following types of calculations: single-point energy (SP) calculations at different levels of theory, FORCE, this type of calculation corresponds to an SP calculation followed by the calculation of the first derivatives of the energy with respect to the position of the atoms in the molecule. The time required for a geometry optimization is a multiple of the time needed for a FORCE calculation. We also carried out frequency as the third type of calculations. Rather than doing a full geometry optimization, a FORCE calculation is recommended for benchmarks that involve hardware performance. It is equivalent as doing one cycle of the optimization and should provide a good approximation for the performance of an optimization calculation. A large number of approximate theoretical methods have been reported in the literature [11]. These methods range in accuracy and computational cost. Since Gaussian provides most of these methods it is important to understand how is that they perform as a function of system resources. In this study we refer to system resources as the number of processors, memory and disk space needed for optimal performance. The theoretical methods chosen in this study have been extensively discussed in the literature [11] and it is beyond the scope of this work to describe these methods. The approximation used in this work correspond to Hartree-Fock [11], the three parameter density functional theory of Becke (B3-LYP) [14], and Configuration Interaction Singles (CIS) energy and gradients [15]. The cases used in this study correspond to a subset selected from a previous study [9] augmented with a FORCE calculation using Valinomycin. The system used for cases I and III is a-pinene. Case I is an SP calculation at the HF level of theory using 6-311G(df,p) basis sets. Case II corresponds to a SP FORCE calculation on a fairly large system (Valinomycin). The third case tested is a-pinene frequency calculation at the B3-LYP/6-31G(d) level of theory. Cases II and III exercise several of the links that run in parallel. To better understand which links run in parallel Table 3 shows the links that are executed during these calculations and the ones that run in parallel. Although each of these calculations execute several links, it is important to note that most of the CPU time is spent in the parallel links. This ensures good scalability. IBM ~ Performance Technical Report Suggestions for Performing Gaussian Benchmarks Page 7 Table 3. List of Links Executed in Cases I-IV l9999.exe l9999.exe l103.exe l103.exe l716.exe l716.exe l703.exe l703.exe l702.exe l702.exe l701.exe l9999.exe l701.exe l601.exe l103.exe l601.exe l1002.exe l716.exe l1002.exe l1110.exe l703.exe l914.exe l1102.exe l702.exe l801.exe l1101.exe l701.exe l502.exe l801.exe l601.exe l9999.exe l401.exe l502.exe l502.exe l601.exe l303.exe l401.exe l401.exe l502.exe l308.exe l303.exe l303.exe l401.exe l302.exe l302.exe l302.exe l303.exe l301.exe l301.exe l301.exe l302.exe l202.exe l202.exe l202.exe l301.exe l103.exe l103.exe l103.exe l202.exe l101.exe l101.exe l101.exe l101.exe l1.exe l1.exe l1.exe l1.exe Case IV Case IIII Case II Case I a Parallel link are in italics. b HF/6-311G(df,p). c B3-LYP/3-21G FORCE. d B3-LYP/6-31G(d) FREQ. e CIS/6-31++G FORCE. Cases IV computes excited states using CI single excitations FORCE calculation with the 6-31++G basis sets [15] on acetyl phenol. This set of benchmarks represent small-large systems to test speedup and efficiency for most parallel links. All the geometries are available from Ref. [16]. Results and Discussion In this study, as in our previous reports, we look at performance in terms of speedup [9]. Speedup (S) is defined as the ratio of the serial run time (elapsed time, ts) over the time that it takes to do the same problem in parallel (elapsed time, tp).
منابع مشابه
Ab Initio Quantum Chemistry on the IBM pSeries 690
In this study, we compare the performance of the POWER3 processor and the new IBM eServer pSeries 690. The pSeries 690 can scale up to 32-way POWER4 processor at 1.3 GHz and 1.1 GHz. To perform this comparison we used the Gaussian98 Revision A.11 series of electronic structure programs. It is an integrated system to model molecular systems under a variety of conditions, carrying out its calcula...
متن کاملSome Practical Suggestions for Performing NCBI BLAST Benchmarks on a pSeries TM 690 System
In this study, we present a series of benchmarks carried out on pSeriesTM servers using BLAST (Basic Local Alignment Search Tool), one of the most popular and widely used programs for similarity searching. In particular, we compare the performance of the POWER3 series with the new pSeries 690 eServer. Scalability is measured as a function of the type of query (input fragment sequence) and type ...
متن کاملHighly Parallel Linpack on the IBM p 690 ∗
Oak Ridge National Laboratory installed 27 32-way IBM pSeries 690 SMP nodes in early 2002. In this paper, we describe the Highly Parallel Linpack results on this machine.
متن کاملDual-level parallelism for ab initio molecular dynamics: Reaching teraflop performance with the CPMD code
We show teraflop performance of the fully featured ab initio molecular dynamics code CPMD on an IBM pSeries 690 cluster. A mixed distributed-memory, coarse-grained parallel approach using the MPI library and shared-memory, fine-grained parallelism using OpenMP directives is used to optimally map the algorithms on the available hardware. The top performance achieved is approximate to 20% of the ...
متن کاملRobust Distributed Source Coding with Arbitrary Number of Encoders and Practical Code Design Technique
The robustness property can be added to DSC system at the expense of reducing performance, i.e., increasing the sum-rate. The aim of designing robust DSC schemes is to trade off between system robustness and compression efficiency. In this paper, after deriving an inner bound on the rate–distortion region for the quadratic Gaussian MDC based RDSC system with two encoders, the structure of...
متن کامل